<<<<<<< HEAD Fake or Real News? An NLP Model

Purpose

Fake news is a big topic, and what with Machine Learning taking off as it is, I imagine it will become even more important to be able to distinguish between fake and real news sources. I came accross this dataset on Kaggle and thought it would be fun to try to build an NLP model that could distinguish between real and fake news sources.

Setting up the Script

First steps are pretty intuitive: * Call needed Libraries * Read in necessary data

library(dplyr)
library(plotly)
## Warning: package 'ggplot2' was built under R version 4.4.3
library(stopwords)
## Warning: package 'stopwords' was built under R version 4.4.3
library(stringr)
library(tm)
## Warning: package 'tm' was built under R version 4.4.3
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.4.3
library(textstem)
## Warning: package 'textstem' was built under R version 4.4.3
## Warning: package 'koRpus.lang.en' was built under R version 4.4.3
## Warning: package 'koRpus' was built under R version 4.4.3
## Warning: package 'sylly' was built under R version 4.4.3
library(tidyr)
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 4.4.3
library(splitTools)
## Warning: package 'splitTools' was built under R version 4.4.3
library(keras)
## Warning: package 'keras' was built under R version 4.4.3
library(Matrix)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.4.3
library(tensorflow)
## Warning: package 'tensorflow' was built under R version 4.4.3
library(naivebayes)
## Warning: package 'naivebayes' was built under R version 4.4.3
#Import Data
fake_0 <- read.csv("Fake.csv")
real_0 <- read.csv("True.csv")

Exploring the Data

We need to actually look at the data, and ensure we are working with the same parameters in both files.

head(fake_0, 1)
##                                                                             title
## 1  Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            text
## 1 Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year,  President Angry Pants tweeted.  2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America!  Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t even allow him to rise above the gutter long enough to wish the American citizens a happy new year!  Bishop Talbert Swan (@TalbertSwan) December 31, 2017no one likes you  Calvin (@calvinstowell) December 31, 2017Your impeachment would make 2018 a great year for America, but I ll also accept regaining control of Congress.  Miranda Yaver (@mirandayaver) December 31, 2017Do you hear yourself talk? When you have to include that many people that hate you you have to wonder? Why do the they all hate me?  Alan Sandoval (@AlanSandoval13) December 31, 2017Who uses the word Haters in a New Years wish??  Marlene (@marlene399) December 31, 2017You can t just say happy new year?  Koren pollitt (@Korencarpenter) December 31, 2017Here s Trump s New Year s Eve tweet from 2016.Happy New Year to all, including to my many enemies and those who have fought me and lost so badly they just don t know what to do. Love!  Donald J. Trump (@realDonaldTrump) December 31, 2016This is nothing new for Trump. He s been doing this for years.Trump has directed messages to his  enemies  and  haters  for New Year s, Easter, Thanksgiving, and the anniversary of 9/11. pic.twitter.com/4FPAe2KypA  Daniel Dale (@ddale8) December 31, 2017Trump s holiday tweets are clearly not presidential.How long did he work at Hallmark before becoming President?  Steven Goodine (@SGoodine) December 31, 2017He s always been like this . . . the only difference is that in the last few years, his filter has been breaking down.  Roy Schulze (@thbthttt) December 31, 2017Who, apart from a teenager uses the term haters?  Wendy (@WendyWhistles) December 31, 2017he s a fucking 5 year old  Who Knows (@rainyday80) December 31, 2017So, to all the people who voted for this a hole thinking he would change once he got into power, you were wrong! 70-year-old men don t change and now he s a year older.Photo by Andrew Burton/Getty Images.
##   subject              date
## 1    News December 31, 2017
head(real_0,1)
##                                                              title
## 1 As U.S. budget fight looms, Republicans flip their fiscal script
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  text
## 1 WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support education, scientific research, infrastructure, public health and environmental protection. “The (Trump) administration has already been willing to say: ‘We’re going to increase non-defense discretionary spending ... by about 7 percent,’” Meadows, chairman of the small but influential House Freedom Caucus, said on the program. “Now, Democrats are saying that’s not enough, we need to give the government a pay raise of 10 to 11 percent. For a fiscal conservative, I don’t see where the rationale is. ... Eventually you run out of other people’s money,” he said. Meadows was among Republicans who voted in late December for their party’s debt-financed tax overhaul, which is expected to balloon the federal budget deficit and add about $1.5 trillion over 10 years to the $20 trillion national debt. “It’s interesting to hear Mark talk about fiscal responsibility,” Democratic U.S. Representative Joseph Crowley said on CBS. Crowley said the Republican tax bill would require the  United States to borrow $1.5 trillion, to be paid off by future generations, to finance tax cuts for corporations and the rich. “This is one of the least ... fiscally responsible bills we’ve ever seen passed in the history of the House of Representatives. I think we’re going to be paying for this for many, many years to come,” Crowley said. Republicans insist the tax package, the biggest U.S. tax overhaul in more than 30 years,  will boost the economy and job growth. House Speaker Paul Ryan, who also supported the tax bill, recently went further than Meadows, making clear in a radio interview that welfare or “entitlement reform,” as the party often calls it, would be a top Republican priority in 2018. In Republican parlance, “entitlement” programs mean food stamps, housing assistance, Medicare and Medicaid health insurance for the elderly, poor and disabled, as well as other programs created by Washington to assist the needy. Democrats seized on Ryan’s early December remarks, saying they showed Republicans would try to pay for their tax overhaul by seeking spending cuts for social programs. But the goals of House Republicans may have to take a back seat to the Senate, where the votes of some Democrats will be needed to approve a budget and prevent a government shutdown. Democrats will use their leverage in the Senate, which Republicans narrowly control, to defend both discretionary non-defense programs and social spending, while tackling the issue of the “Dreamers,” people brought illegally to the country as children. Trump in September put a March 2018 expiration date on the Deferred Action for Childhood Arrivals, or DACA, program, which protects the young immigrants from deportation and provides them with work permits. The president has said in recent Twitter messages he wants funding for his proposed Mexican border wall and other immigration law changes in exchange for agreeing to help the Dreamers. Representative Debbie Dingell told CBS she did not favor linking that issue to other policy objectives, such as wall funding. “We need to do DACA clean,” she said.  On Wednesday, Trump aides will meet with congressional leaders to discuss those issues. That will be followed by a weekend of strategy sessions for Trump and Republican leaders on Jan. 6 and 7, the White House said. Trump was also scheduled to meet on Sunday with Florida Republican Governor Rick Scott, who wants more emergency aid. The House has passed an $81 billion aid package after hurricanes in Florida, Texas and Puerto Rico, and wildfires in California. The package far exceeded the $44 billion requested by the Trump administration. The Senate has not yet voted on the aid. 
##        subject               date
## 1 politicsNews December 31, 2017
colnames(fake_0)
## [1] "title"   "text"    "subject" "date"
colnames(real_0)
## [1] "title"   "text"    "subject" "date"
substr(real_0$text, 1, 100)[1:5]
## [1] "WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted"
## [2] "WASHINGTON (Reuters) - Transgender people will be allowed for the first time to enlist in the U.S. m"
## [3] "WASHINGTON (Reuters) - The special counsel investigation of links between Russia and President Trump"
## [4] "WASHINGTON (Reuters) - Trump campaign adviser George Papadopoulos told an Australian diplomat in May"
## [5] "SEATTLE/WASHINGTON (Reuters) - President Donald Trump called on the U.S. Postal Service on Friday to"

It seems that the real data ALL comes from Reuters, and has “LOCATION - (REUTERS)” at the beginning of each text. While this would be great for achieving a nearly perfectly accurate model, that is not realistic so unfortunately, we will have to filter that out and have a model that does not perform perfectly (I know, I know. But I’m no cheater).

real <- real_0 %>%
  mutate("text" = gsub("^.+\\(reuters\\) ", "", tolower(text)))

substr(real$text, 1, 100)[1:5]
## [1] "- the head of a conservative republican faction in the u.s. congress, who voted this month for a hug"
## [2] "- transgender people will be allowed for the first time to enlist in the u.s. military starting on m"
## [3] "- the special counsel investigation of links between russia and president trump’s 2016 election camp"
## [4] "- trump campaign adviser george papadopoulos told an australian diplomat in may 2016 that russia had"
## [5] "- president donald trump called on the u.s. postal service on friday to charge “much more” to ship p"

We now need to look for missing values and deal with them as needed

summary(fake_0)
##     title               text             subject              date          
##  Length:23481       Length:23481       Length:23481       Length:23481      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character
summary(real_0)
##     title               text             subject              date          
##  Length:21417       Length:21417       Length:21417       Length:21417      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

There are technically no NA values, but often times with text data, “NA”s are actually empty strings, so that should also be checked for

emptyFake <- fake_0 %>%
  filter((title == "") | (text == "") | (subject == ""))

emptyReal <- real %>%
  filter((title == "") | (text == "") | (subject == ""))

print(paste(nrow(emptyFake), "missing from Fake", nrow(emptyReal), "missing from Real"))
## [1] "0 missing from Fake 0 missing from Real"

Fake data is missing 7 values, all subjects. We probably won’t use subjects, or dates (see ahead) so it’s no matter

Combining and Visualizing the Data

We still need to look closer at the text. More likely than not, rigorous text cleaning will need to happen. But I kinda love Cleaning text data, I just think it’s so cool.

First, The data needs to be concatenated, and the dates still need to be cleaned up.

concat <- fake_0 %>%
  mutate("target" = "Fake",
         "date" = format(strptime(date, format = "%d-%b-%y"), "%Y-%m-%d")
         ) %>% #Clean up the date column
  rbind.data.frame(real %>% mutate("target" = "Real",
                                     "date" = format(strptime(date, format = "%B %d, %Y"), "%Y-%m-%d")
                                     )
                   ) #The files have different date formats

Let’s look at some distribution stuff before we start cleaning the text, starting with the distribution of news subjects between the two news types. I want to look at the distributions of each, but also compare them a little closer. In this particular case, the csv file was importing strangely and using some commas from the text column as deliminators. This was only an issue for the fake news csv, which has much dirtier data than the real news csv.

To work around it, I went into the csv itself and determined what the subjects were, and manually put them into respective lists.

)

It’s a shame the real data types are so limited, but it does seem like true news is more likely to report on worldly affairs.

I also want to look at the dates that the stories for each type were published. Let’s look at it by month:

We see a jump in fake news when the election year stars, and a drop off around November 2017 when real news has a huge drive up We also see that we don’t have real news data before the 2016 election, which could skew our model.

Personally, the amount of fake vs real is somewhat alarming. I wonder if this dataset is skewed with having more fake datapoints, rather than there actually being more fake data during the election year

Cleaning the Text Data

Combine the title and text columns, as they are related and we don’t want to create twice as many lemmas/Corpora.

Stop words also need to be defined and cleaned out. This will help us look at the more important words with actual meaning (we don’t care about words like “he”, “she”, “the”).

First step is to create a corpus, or the collection of the texts. I’ll be using the tm library

#combine columns, select only what we are interested in
cleanData <- concat %>%
  mutate("combText" = paste(title, text)) %>%
  mutate("target" = ifelse(target == "Fake", 1,0)) %>%
  select(combText, target) #Since the date range is larger for fake, and since fake has more subject identifiers than real, I am taking those out


#Create corpus and clean the text accordingly
corp <- VCorpus(VectorSource(cleanData$combText))

stopWordsFun <- function(x) removeWords(x, c(stopwords("en"), "just", "the", "they", "get", "this", "for"))
funs <- list(content_transformer(tolower),
             removePunctuation,
             removeNumbers,
             stopWordsFun,
             removePunctuation,
             stripWhitespace
             )

corp <- tm_map(corp, FUN = tm_reduce, tmFuns = funs)
corp[["1"]][["content"]]
## [1] " donald trump sends out embarrassing new year’s eve message this  disturbing donald trump  couldn t wish  americans  happy new year  leave    instead    give  shout    enemies haters    dishonest fake news media the former reality show star   one job     couldn t   as  country rapidly grows stronger  smarter i want  wish    friends supporters enemies haters  even   dishonest fake news media  happy  healthy new year president angry pants tweeted  will   great year  america as  country rapidly grows stronger  smarter i want  wish    friends supporters enemies haters  even   dishonest fake news media  happy  healthy new year  will   great year  america donald j trump realdonaldtrump december  trump s tweet went    welll   d expectwhat kind  president sends  new year s greeting like  despicable petty infantile gibberish only trump his lack  decency won t even allow   rise   gutter long enough  wish  american citizens  happy new year bishop talbert swan talbertswan december  no one likes  calvin calvinstowell december  your impeachment  make   great year  america  i ll also accept regaining control  congress miranda yaver mirandayaver december  do  hear  talk when    include  many people  hate     wonder why     hate  alan sandoval alansandoval december  who uses  word haters   new years wish marlene marlene december  you can t  say happy new year koren pollitt korencarpenter december  here s trump s new year s eve tweet  happy new year   including   many enemies     fought   lost  badly   don t know    love donald j trump realdonaldtrump december  this  nothing new  trump he s     yearstrump  directed messages   enemies  haters  new year s easter thanksgiving   anniversary   pictwittercomfpaekypa daniel dale ddale december  trump s holiday tweets  clearly  presidentialhow long   work  hallmark  becoming president steven goodine sgoodine december  he s always  like       difference     last  years  filter   breaking  roy schulze thbthttt december  who apart   teenager uses  term haters wendy wendywhistles december  he s  fucking  year old who knows rainyday december  so    people  voted    hole thinking   change   got  power   wrong yearold men don t change  now  s  year olderphoto  andrew burtongetty images"

We need to create lemmas from the corpus, now that the data is cleaned. Lemmas are essentially breaking down the words into their base form (ex: scripted vs script). This reduces the number of words the model needs to train on without sacrificing much accuracy. We can use a combination of the textstem library(which has lemmatization capability) and the tm library to perform the function across all strings in our corpus

corpLemma <- tm_map(corp, content_transformer(lemmatize_strings))

Next, we create the Document Term Matrix which is a matrix that shows the occurrence of every word (columns) and their frequency occurrence in each document (rows). This often has MANY zero values, which we will address via TF-IDF which calculates the importance of an individual term in relation to the entire corpus. It takes into account the frequency of a term in each individual document (Term Frequency) and the frequency in the entire corpus (Inverse Document Frequency)

Note that I tested this with normalized data first, but the model did not perform as well as it did with non-normalized data

dtm <- DocumentTermMatrix(corpLemma)
dtm <- removeSparseTerms(dtm, sparse = 0.98) #Could probably neglect this, but I have memory restrictions to work with
inspect(dtm)
## <<DocumentTermMatrix (documents: 44898, terms: 1511)>>
## Non-/sparse entries: 4687677/63153201
## Sparsity           : 93%
## Maximal term length: 15
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs    make one people president republican say state the trump will
##   12407    6   7      6         8          5   1    45  11     1    3
##   17850    6   8      6         9          6   1    46  15     5    4
##   19622    6   7      6         8          5   1    45  11     1    3
##   21979    4  18      4         2          0  25    15  31     0   10
##   22304    7  25      5         4          0  13     7  45     6    7
##   22348    8  13      4         4          2  13     7  57    19    4
##   22762    4  18      4         2          0  25    15  31     0   10
##   23087    7  25      5         4          0  13     7  45     6    7
##   23131    8  13      4         4          2  13     7  57    19    4
##   9732     6   8      6         9          6   1    46  15     5    4
dtmNorm <- weightTfIdf(dtm, normalize=FALSE)
#dtmTFIDF <- weightTfIdf(dtm, normalize=FALSE)
#For some reason, not all the stop words were getting taken out, so I did some manually here just to be sure
dtmNorm <- dtmNorm[, !colnames(dtmNorm) %in% c(stopwords("en"), "just", "the", "they", "get", "this", "for")]
#dtmNorm <- scale(dtmNorm) #Normalize after removing other stop words
#inspect(dtmNorm)
#inspect(dtmTFIDF)

wordDF <- tidytext::tidy(dtmNorm) %>%
  mutate("document" = as.numeric(document)) %>%
  left_join(concat %>% 
              select(target) %>%
              mutate("document" = row_number()), by = "document"
            ) %>%
  group_by(target, term) %>%
  mutate("totCount" = sum(count)) %>%
  distinct(term, target, totCount) %>%
  ungroup() %>%
  filter(!term %in% c(stopwords("en"), "just", "the", "they", "get", "this", "for"))

Visualization of the DTM

Now that we have our individual terms and their respective targets, we can visualize them The following charts include:

  • Word cloud of fake news
  • Word cloud of real news
  • Bar chart of most common words in all the dtm
  • Bar chart of most common words in real data, with a comparison to the same word in the fake set
  • Bar chart of most common words in fake data, with a comparison to the same word in the real set
#Now that we have our individual terms and their respective targets, we can visualize them

#Fake news word cloud
wordDF %>%
  filter(target == "Real") %>%
  select(-target) %>%
  arrange(desc(totCount)) %>%
  wordcloud2(shuffle=FALSE)
#Real news word Cloud
wordDF %>%
  filter(target == "Fake") %>%
  select(-target) %>%
  arrange(desc(totCount)) %>%
  wordcloud2(shuffle=FALSE)

Seems Trump was a popular guy during this time frame. Real news also seems to report less on specific politicians than fake news does. Fake news focuses on a select few words very often, while real news has a lot more variety in their word map. There are also a lot more countries mentioned in real news, supporting the above finding that real news is more likely to report on world-news than fake news is.

I also feel the need to call out the fact that “fox” shows up pretty frequently in fake news. Considering they are not legally a news source, it could be possible to further indicate fake news by looking for the names of entertainment industries over legit news sources.

#Top 10 words in both fake and real combined, and the comparison of them between the two
top10 <- wordDF %>%
  group_by(term) %>%
  mutate("orderCount" = sum(totCount)) %>%
  ungroup() %>%
  top_n(20, orderCount) %>%
  arrange(desc(orderCount)) %>%
  ungroup()
orderTop <- top10 %>%
  distinct(term, orderCount)

orderList <- factor(unique(orderTop$term), ordered=TRUE, levels=unique(orderTop$term)) #Order the bar plot in decending order

top10 %>%
  plot_ly(x=~term, y=~totCount, color =~target, type='bar', colors=c("#2AB285", "#965483")) %>%
  layout(barmode = "stack",
         yaxis=list(title="Count"),
         xaxis=list(title="Word",
                    categoryorder = "array",
                    categoryarray = orderList
         ),
         title = "Overall Top 10 Frequency of words<br>What is the Distribution of the most Frequent Words?",
         legend=list(title=list(text='News Type'))
         
  )

The top 10 words show that both sources cite the word “Trump” almost a similar number of times Fake news tends to have a tendency to use individual’s names more often than real news. Real news uses the party words like “party” and “republican” more often.

#Most frequent words in the real dataset, and how the fake dataset compares to them
topReal <- wordDF %>%
  filter(target == "Real") %>%
  top_n(10, totCount) %>%
  arrange(desc(totCount))


realBar <- wordDF %>%
  filter(term %in% c(topReal$term)) %>%
  arrange(desc(target), desc(totCount)) %>%
  mutate("term" = factor(term, levels=unique(term)), #Change to factors so Plotly knows what order to put them in
         "target" = factor(target, levels=c("Real", "Fake"))
  ) %>%
  plot_ly(x=~term, y=~totCount, color =~target, type='bar', colors=c("#965483", "#2AB285"), showlegend = FALSE) %>%
  layout(yaxis=list(title="Count"),
         xaxis=list(title="Word")
         #title = "Top 10 Frequency of words in Real News<br>How does Fake News Compare with the Most Frequent Words in Real News?",
         #legend=list(title=list(text='News Type'))
         
         
  )
  

#Most frequent words in the fake dataset, and how the real dataset compares to them
topFake <- wordDF %>%
  filter(target == "Fake") %>%
  top_n(10, totCount) %>%
  arrange(desc(totCount))
  
  
fakeBar <- wordDF %>%
  filter(term %in% c(topFake$term)) %>%
  arrange(desc(totCount)) %>%
  mutate("term" = factor(term, levels=unique(term))) %>% #Change to factors so Plotly knows what order to put them in 
  plot_ly(x=~term, y=~totCount, color =~target, type='bar', colors=c("#2AB285", "#965483")) %>%
  layout(yaxis=list(title="Count"),
         xaxis=list(title="Word"
         ),
         title = "Top 10 Most Frequent Words",
         legend=list(title=list(text='News Type'))
         
  )

subplot(realBar, fakeBar, shareY = TRUE) %>%
  layout(annotations = list( 
    list( 
      x = 0.2,  
      y = 1.0,  
      text = "Top 10 Words in Real News",  
      xref = "paper",  
      yref = "paper",  
      xanchor = "center",  
      yanchor = "bottom",  
      showarrow = FALSE 
    ),  
    list( 
      x = 0.8,  
      y = 1,  
      text = "Top 10 Words in Fake News",  
      xref = "paper",  
      yref = "paper",  
      xanchor = "center",  
      yanchor = "bottom",  
      showarrow = FALSE 
    ))
  )

Comparing the two in terms of word frequency, real news is fairly consistent in their most used word frequency, while, again, fake news has a strong tendency to use politician’s names. Real news also uses more “worldly” words while fake news tends to target specific groups and people (women, black, people, trump, obama, ect.).

So far, it seems our indicators of fake news are:

  • Overuse of politician names
  • Lack of reporting on world events
  • Potentially higher reporting frequency during important political events (intuitive, and real news is likely to do the same)
  • Targeting specific people or groups of people (black, woman, Muslim, ect)

Building an RNN Model

RNN models are usually a decent model when it comes to text processing. In NLP, words are not necessarily independent of one another because sentences are structured in specific ways. Because of that, we need a model that can take these dependencies into account. The steps to set up the data for the model are as follows: 1. change dtm into a matrix, identify our target variable 2. Split new matrix into training and test data 3. Reshape the data to work with tensorflow’s model creation 4. Create the model and complie it with specific parameters 5. Train the model

First, we’ll focus on splitting the data

#Change dtm into a matrix
dtmMat <- as.matrix(dtmNorm) %>%
  scale()
targets <- cleanData$target # Define our target variable


#Split data into train and test data 70/30 split
indexSplit <- sample((1:dim(dtmMat)[1]), floor(dim(dtmMat)[1]*0.70))
trainData <- dtmMat[indexSplit, ] #70% of the data
trainTarget <- targets[indexSplit]
testData <- dtmMat[-indexSplit, ] #30% of the data
testTarget <- targets[-indexSplit]

#Ensure splits are valid/even. The sums of each target type should align with the cleanData's target column
table(trainTarget)
## trainTarget
##     0     1 
## 15055 16373
table(testTarget)
## testTarget
##    0    1 
## 6362 7108
table(cleanData$target)
## 
##     0     1 
## 21417 23481
print(paste0("Real Aligns?: ", table(trainTarget)[1] + table(testTarget)[1] == table(cleanData$target)[1]))
## [1] "Real Aligns?: TRUE"
print(paste0("Fake Aligns?: ", table(trainTarget)[2] + table(testTarget)[2] == table(cleanData$target)[2]))
## [1] "Fake Aligns?: TRUE"

We can see by the output of the tables that our data was split properly. The sums of each table’s relevant target aligns with the initial data that was cleaned above.

Next, we can begin to create the model. Since this is a binary classification problem, we will use sigmoid activation with 1 neuron in the dense layer. More dimensions are usually better, but memory limitations will keep this particualr case under 300

# 
# # Reshape the input for RNN
# xTrain <- array(trainData, dim = c(nrow(trainData), ncol(trainData), 1)) #Documents is the number of rows, terms is the number of columns
# xTest <- array(testData, dim = c(nrow(testData), ncol(testData), 1))
# 
# # Define the RNN model architecture
# model <- keras_model_sequential() %>%
#   layer_embedding(input_dim = dim(trainData)[2], output_dim = 128, mask_zero = TRUE) %>% #unique words are the columns of the matrix. 100 dimensions because memory limits
#   layer_gru(64, return_sequences = TRUE) %>%
#   layer_lstm(units = 32, return_sequences = TRUE) %>%
#   #layer_lstm(units = 150, go_backwards = TRUE) %>%
#   layer_dense(units = 15, activation = "sigmoid") %>% #200 neurons in the dense layer, sigmoid for binary classifier
#   layer_dense(units = 1, activation = "relu") #Ensure the tensor shape comes out the same shape as the target
# 
# 
# # Compile the model
# model %>% compile(
#   loss = "binary_crossentropy",  # loss function
#   optimizer = "adam",  # optimizer
#   metrics = c("accuracy")  # metrics
# )
# 
# #Train the model
# history <- model %>% fit(
#   x = trainData,
#   y = trainTarget,
#   epochs = 5,
#   batch_size = 32, #We can figure out number of runs in each epoch by rounding nrow(xTrain)/32 up
#   shuffle = TRUE,
#   validation_data = list(testData, testTarget)  # Include validation data
# )
# # #gc()
# history
# lossTrain <- history$metrics$loss
# lossVal <- history$metrics$val_loss
# epochs <- seq_along(lossTrain)
# 
# # Plot the training and validation loss using plot_ly
# plot_ly() %>%
#   add_trace(x = epochs, y = lossTrain, name = "Training Loss", type = "scatter", mode = "lines", line = list(color = "#2AB285")) %>%
#   add_trace(x = epochs, y = lossVal, name = "Validation Loss", type = "scatter", mode = "lines", line = list(color = "#965483")) %>%
#   layout(title = "Training and Validation Loss",
#          xaxis = list(title = "Epoch"),
#          yaxis = list(title = "Loss"))

This is performing…less than ideal. Not Much better than flipping a coin, really. Our loss is a flat line, indicating this model is not ideal for this dataset.

Try another approach: a naive bayes model

Building a Naive Bayes Model

#---------Naive Bayes Model-----------
trainDataDF <- as.data.frame(trainData) %>%
  select(-target) %>%
  mutate("target" = as.factor(trainTarget))

testDataDF <- as.data.frame(testData) %>%
  mutate("target" = as.factor(testTarget)) 


names(trainDataDF) <- make.names(names(trainDataDF))
names(testDataDF) <- make.names(names(testDataDF))


nbModel <- naive_bayes(target ~ ., data = trainDataDF)
summary(nbModel)
## 
## ================================= Naive Bayes ================================== 
##  
## - Call: naive_bayes.formula(formula = target ~ ., data = trainDataDF) 
## - Laplace: 0 
## - Classes: 2 
## - Samples: 31428 
## - Features: 1472 
## - Conditional distributions: 
##     - Gaussian: 1472
## - Prior probabilities: 
##     - 0: 0.479
##     - 1: 0.521
## 
## --------------------------------------------------------------------------------
trainDataDF$nbPred <- predict(nbModel, type = 'class')
testDataDF$nbPred <- predict(nbModel, newdata = testDataDF)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
caret::confusionMatrix(testDataDF$target, testDataDF$nbPred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5886  476
##          1 1419 5689
##                                           
##                Accuracy : 0.8593          
##                  95% CI : (0.8533, 0.8651)
##     No Information Rate : 0.5423          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7199          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8057          
##             Specificity : 0.9228          
##          Pos Pred Value : 0.9252          
##          Neg Pred Value : 0.8004          
##              Prevalence : 0.5423          
##          Detection Rate : 0.4370          
##    Detection Prevalence : 0.4723          
##       Balanced Accuracy : 0.8643          
##                                           
##        'Positive' Class : 0               
## 

This performs much better, having an accuracy around 86%. With more computing power, we could optimize parameters and try something like a grid search over a support vector model.

However, there are some caveats here such as ALL real data comes from Reuters. If another site has a different writing style (very likely) then it is quite possible it would still be labeled as fake simply because it doesn’t have similar verbiage that Reuters uses. We can see this is quite probably the case as our specificity is a bit higher than the sensitivity

======= >>>>>>> 28ca066eac66ca63c06dd08392c14c2479e4f1bc